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Abstract 

The IUPAC International Chemical Identifier (InChl) provides a method to generate a unique text descriptor of 
molecular structures. Building on this work, we report a process to generate a unique text descriptor for reactions, 
RlnChl. By carefully selecting the information that is included and by ordering the data carefully, different scientists 
studying the same reaction should produce the same RlnChl. If differences arise, these are most likely the minor 
layers of the InChl, and so may be readily handled. RlnChl provides a concise description of the key data in a 
chemical reaction, and will help enable the rapid searching and analysis of reaction databases. 



Background 

Since its inception, the IUPAC International Chemical 
Identifier (InChl) [1,2] has found wide acceptance as a 
standard in the chemical community. In order to widen 
the applicability of the identifier, the IUPAC Division 
VIII Subcommittee and the InChl Trust [3] have initi- 
ated several projects to extend the usage of the identifier. 
Among these is the development of a non-proprietary, 
international identifier for reactions (RlnChl) [4] to de- 
scribe chemical reactions in a unique machine-readable 
character string based on the InChl algorithm suitable for 
data storage and indexing. For this purpose, a working 
group was established in 2008 and the initial develop- 
mental work was carried out at Cambridge University 
under the supervision of Jonathan Goodman resulting in a 
preliminary working version of the program. This note is 
an interim report based on the discussions of the working 
group, the work on the project carried out by Chad Allen 
[5] and others in the Goodman group and a presentation 
by Guenter Grethe at the 8 th German Conference on 
Chemoinformatics [6]. Further work will be carried out 
before publication of the RlnChl standard. 



Introduction 

A number of methods are available to represent molecular 
structures as a single line of text. The most commonly 
used of these are SMILES, developed by Daylight Chemical 
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Information Systems, Inc, [7] and the IUPAC International 
Chemical Identifier (InChl). Different researchers inves- 
tigating the same molecular structure, should be able to 
write down the same InChl and the same canonical 
SMILES without needing to consult each other. It would 
be very useful to be able to do the same thing for reactions. 
However, comparing reactions is much more challenging 
than comparing structures as more information is available 
and decisions have to be made which aspects of this infor- 
mation must be stored. 

Daylight [7] has developed SMILES so that they can 
be used to describe reactions and SMILES to describe 
transformations [7]. The Sybyl Line Notation (SLN) [8] 
can also be used to represent chemical reactions in a line 
notation. Both of these approaches are powerful and 
flexible, permitting the inclusion of a range of informa- 
tion including atom-mapping. Both are excellent tools to 
describe reactions. However, different researchers study- 
ing the same reaction may well select different data to 
include in the line notation, and so generate different 
descriptions of one reaction. 

The objective of the RlnChl project is the creation of 
an unambiguous description for reactions from their 
structural diagrams, Rxn- and RDfiles for which different 
researchers should, so far as possible, generate the same 
identifier for the same reaction. The generated identifier 
will allow the organization and validation of new reac- 
tion databases and will enable the comparison of differ- 
ent data sources. In line with the multi-layer concept of 
InChl, the basic RlnChl in addition to the InChls of re- 
actants, products, solvents, and catalysts must include 
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information about equilibrium, unbalanced or multi-step 
reactions. Furthermore, the format of the identifier has 
to be open to include future information, such as reac- 
tion conditions and non-unique molecular entities. Since 
the identifier can be quite long depending on the num- 
ber of participating molecules, long and short versions 
of RInChlKeys were developed. The RInChI project soft- 
ware is implemented as an importable Python package, 
including usage scripts for conversion, addition and 
analysis. 

RInChI format 

Full RInChI string 

Analogous to InChI, the RInChI format is a hierarchical, 
layered description of a reaction with different levels. The 
RInChI of version 0.02 includes the RInChI label, three 
groups of molecules and further information layers. 

The label starts with the acronym RInChI, followed by 
the RInChI version number and the InChI version num- 
ber used to generate molecule InChls separated by a 
period. In the example shown in Figure 1, the label reads 
RInChI = 0.02. IS, i.e. the RInChI version is 0.02 and the 
InChI version is IS. The RInChI version number will al- 
ways have exactly one decimal point. 

Three groups of molecules are described in the RInChI 
identifier, one group for each side of the arrow and one 
group of molecules which are above, below or on both 
sides of the arrow, i.e. solvents and catalysts. Each group is 
described as a list of InChls which are sorted within a 
group. After sorting the molecules within a group, the 
groups representing starting materials and products are 
sorted using the unix sort' command. Valid RInChls do 
not require all three groups to be present. For example, a 
RInChI of a reaction without a known product and no in- 
formation about solvents/reagents would only show the 
first group. Individual InChls within a group are separated 



by a double slash 7/" and the groups of molecules are sep- 
arated by a triple slash "///". Since the display of the first 
two groups in a RInChI does not indicate which one repre- 
sents reactants or products, directionality is shown by an 
additional layer: 7d+" indicates that reactants are followed 
by products, 7d-" represents the reverse direction and 
7d=" represents an equilibrium reaction. Additional layers, 
for example information about reaction conditions, might 
be added in future versions of the program. 

For example, the reaction: 1 — ► 2 catalysed by 3, would 
be represented by the RInChI: 

RInChI = 0. 02. 1. S/groupl/ / /group2/ / /group3 / 

Here group 7, group2, group3 are the list of InChls in 1, 
2 and 3 respectively. If the starting material, 1, includes 
several molecules, they would be listed in the order de- 
fined by the unix sort' command, and separated by a 
double slash: 7/". Similarly, group2 may include several 
different products, and group3 may include several cata- 
lysts and other substances which are present both at the 
beginning and end of the reaction, such as solvents. 

The order of groupl and group2 is determined by the 
unix sort' command. The RInChI as written above, does 
not distinguish between 1 —> 2 and 2 —> 1. This is be- 
cause the direction of many reactions, such as acetal 
formation/hydrolysis, is decided by the details of the 
conditions rather than the reagents. The direction of 
the reaction can be indicated by a layer at the end of 
the RInChI: 7d + 7d-" or 7d = 

In this example (Figure 1), groupl is molecule A, 
group2 is molecules B and C, and group3 is omitted as 
the reaction diagram does not include any information 
about solvents or catalysts. The direction of the reaction 
is indicated by the 7d + " at the end of the string. The 
starting material is in groupl and the products are in 





OH 



OH 



\^To 



RlnChl=0.02. 1 S/C8H1 202/c1 -2-4-8-6(3-1 )9-5-7(8)1 0-8/h6-7H,1 -5H2/t6-,7+,8- 
/m 1 /S1///C8H 1 403/C9-6-5-1 1 -7-3-1 -2-4-8(6,7) 1 0/h6-7,9-1 OH , 1 -5H2/t6- 
,7+,8+/m1/s1//C8H1403/c9-6-5-1 1-7-3-1 -2-4-8(6,7)1 0/h6-7,9-1 OH, 1-5H2/t6- 
,7,8+/m1/s1/d+ 

Figure 1 RInChI format: Individual InChls are identified in color, the directional label is black. The colors are not part of the RInChI, and are 
included here only to highlight the different parts of the string. 
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OH 



+ OH- 




OH 



Figure 2 Long RlnChlKey, versions A and B. A. Long RlnChlKey = 
aSA-EFKSL-ZISUZIXPPXXNPC-WDSKDSIN-N-XLYOFNOQVPJJNP- 
UHFFFAOY-M-RLWWHEFTJSHFRN-RITPCOAN-N. B. Long 
RlnChlKey = aSA-FEANN-ZISUZIXPPXXNPC-WDSKDSIN-N- 
XLYOFNOQVPJJNP-UHFFFAOY-M-RLWWHEFTJSHFRN-RITPCOAN-N. 



group2, because the starting material InChls are sorted 
before the product InChls by the unix sort' command. 
Roughly 50% of RInChls for which directionality is de- 
fined are expected to have the products in groupl and 
the starting materials in group2. This is indicated in the 
RInChI by the use of "/d-" in the final layer. However, 
there are likely to be many RInChls which represent 
equilibria with no preferred direction, or else reactions 
for which the directionality is uncertain. In the latter 
case, a RInChI should be used in which the direction 
layer is omitted, and such a string is a valid RInChI. 

RlnChlKeys 

Since full RInChI strings can be very long, it is useful to 
have access to a shortened version. RlnChlKeys are hashed 
representations of the parent RInChls. They are not 
backwards-convertible. However, they are useful for data- 
base manipulations. Two different types of RlnChlKeys 
were developed, a composite of individual InChlKeys (long 
form) and a hashed digest of the RInChI as a whole (short 
form). Each type is available in two versions (A and B), 
with the latter containing additional information. We ex- 
pect version B to be more useful in both cases. 

The RlnChlKeys comprise sequences of letters sepa- 
rated by hyphens. We refer to each sequence as a 'block'. 

Long RlnChlKey 

In the long RlnChlKey all molecules in the reaction are 
encoded as separate InChlKeys and grouped similar to 
the grouping of InChls in RInChls. This process results 
in variable length of the key depending on the number 
of molecules in the reaction. 



Version A As shown in Figure 2A, the first block (group 
of letters) consists of three letters of which the first one 
represents the version identifier and the next two iden- 
tify the constituent InChlKeys. The second block, which 
is separated from the first block with a hypen, is a 
hashed representation of any additional reaction layers 
taken as a whole. The following blocks are groups of 
InChlKeys for all of the molecules in the RInChI follow- 
ing the same order as the molecules in the original 
RInChI. The division between the groups, which is indi- 
cated by a triple slash "///" in the RInChI, is marked in 
the RlnChlKey by a double hyphen. 

The directional information in the RInChI, if present, 
is encoded in block 2 and cannot be extracted from the 
RlnChlKey. 

Version B Because the directional information may be 
useful, we also developed Version B of the long 
RlnChlKey. In this version, the first letter of block 2 is F, 
B, E or U representing forward, backward, equilibrium, 
or unspecified reactions, respectively. The reminder of 
block 2 is a hash of the remaining additional reaction 
layer information. The directional information now al- 
lows identifying or searching for sets of reactants, prod- 
ucts or agents. All the other blocks are identical in 
versions A and B. 

Short RlnChlKeys 

The length of a long RlnChlKey varies with the number 
of molecules included in the RInChI. For some purposes, 
a fixed length key is preferable, even though it can en- 
code less information. We have, therefore, also devel- 
oped short RlnChlKeys which are fixed-length, hashed 
representations of RInChls. They are generated directly 
from the RInChls and do not use the InChlKeys of indi- 
vidual molecular structures. Examples for both versions 
of the RlnChlKey are shown in Figure 3. 

Version A This version encodes the groups of structures 
in a RInChI as simple entities and use the naive hash de- 
scribed for version A of the long RlnChlKey for the re- 
action layers, thereby neglecting the layered character of 



Ph 



a. 



Ph 



Br + OH- 



OH+ Br 




+ OH" 



.OH + Br 



Figure 3 Short RlnChlKeys, version A, for a pair of enantiomeric reactions, a. Short RlnChlKey = aSA-BQUB-IHEAXHQDHHDVWSD-BDQFIIFTLXHFx 
WF-EANNATPGMBMFBIQ. b. Short RlnChlKey = aSA-BQUB-DFUUJGAAOPALL-BIJIEHNYVARRSAG-EANNATPGMBMFBIQ. 
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Ph 




Ph 



Br + OH- 



Br + OH" 



0H+ Br 



.OH + Br 



Figure 4 Short RInChlKeys, version B, for a pair of enantiomeric reactions, a. Short RlnChlKey = bSA-BEANN-CPQZBLWAMR-DVCHMHGSMQ- 
EANNATPGMB-MIILF-MCLVE-NEANN. b. Short RlnChlKey = bSA-BEANN-CPQZBLWAMR-DVCHMHGSMQ-EANNATPGMB-MDSDX-MDUXS-NEANN. 



the InChls. The first two blocks are the same as the first 
two blocks of the long RlnChlKey. These are followed 
by exactly three more blocks, which encode the three 
groups of molecules in the original RInChl. These blocks 
are present even if the group is empty. This leads to 
completely different reactant and product blocks for the 
two enantiomers shown in Figure 3. Note that the fifth 
block, corresponding to group3 is the same for both, be- 
cause it is empty for both reactions. 

Version B Version B again includes directionality in 
block 2 indicated by the first character (see section Ver- 
sion B) and reflects on the layered character of the 
RInChl by separating the InChls into major and minor 
parts. The major parts shown in blocks 3, 4 and 5 repre- 
sent separately hashed layers for chemical formula, con- 
nectivity, hydrogen and charge for the three groups of 
molecules in the RInChl. Note that block 5 is the same 
for both, as it is empty for both. The three following 
blocks are derived from the structures of minor layers 
with the first character of each block indicating the level 
of protonation. The two enantiomers in Figure 4 now 
differ only in the blocks 6 and 7 (highlighted) which in- 
clude information about the stereochemistry. 

Since RInChlKeys omit a large amount of information, 
it must be possible for different reactions to have the 
same RInChl keys. However, the chances of this are very 
low. Only two InChlKey clashes have been reported 
[9-12], despite the huge number of InChlKeys that have 
been generated. The RlnChlKey is larger than the 
InChlKey and so the proportion of clashes should be 
correspondingly lower. Clashes, therefore, are likely to 



be exceedingly infrequent, but it is important to bear in 
mind that they are possible. 

Conversions 

The algorithms for the conversions of Rxnfiles or RDfiles 
to RInChls or RInChlKeys are Python scripts. The 
InChl-to-InChlKey algorithm, available within the offi- 
cial InChI software [1], was modified to a Python imple- 
mentation to facilitate integration. Using the web-based 
conversion tools (Figure 5) on the RInChl website at 
http://www-rinchi.ch.cam.ac.uk, the conversion can eas- 
ily be carried out. 

Generation of a RInChl from a Rxnfile and reverse 
conversion 

A sample conversion is shown in Figure 6. After gener- 
ating and saving a Rxnfile from a structural reaction dia- 
gram, the file is uploaded for conversion on the RInChl 
website. Users then have several options to choose from. 
They can generate the basic RInChl, add the long and 
short RlnChlKey and fill in auxiliary information. 

In the reverse order, a RInChl can be converted to a 
Rxnfile and the corresponding reaction sketch using the 
Decoder (Figure 7) tool of the website. RAuxInfo data 
have to be provided if the Rxnfile should contain 2D coor- 
dinates. If this information is not available, ChemAxons 
MolConverter, provided with the RInChl software package 
must be used. 

*A referee has pointed out that the order of sorting 
the reactants/products can depend on the minor layers 
of the InChI, and so a small change in the minor layers 
of a molecule can have a dramatic effect on the InChI 



M -CMI-. (•::•• Kxritilt-s 



Choose an ran/iie to upload '.na> =. ze 100 kb.i 
D ■Cafia'cRxnlnChrcpaxide [ Browse 

Gf ■ er ate RAuxInfo El 
Generate Long-RlnChlKev El 
Generate Short RlnChlKey El 



| LpbaO Fie j [ Clear J 



RInChls from RONes 

Choose an ROfiie file to u pload (ma» size 2 mb:>: 
| | [ Brow- | 

H B. Reaction records in RWtles often contain data, such as catalysts and solvents, and 
the conversion proo/am attemots to include this m the output RInChls. Any structures 
encountered wh<h cannot be expressed m Inchi format are displayed as "X" within 
the RInChl 



[ Upload File [ [ Ctear | 



Figure 5 Web-based tools for the conversion of RxnFiles and RDFiles to RInChls. 
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H 




QH PH 



CO 



l 



RlnChls from Rxnfik* r . 



Choose an rxnfile to upl oad (max size 100 KB): 
0 \Data AxnlnChf>*poxrit I Browse I 

Genef ate RAuxlnfo El 
Genet ate loogRInCnlKey El 
Generate Short WnCWKev 0 



Frfe Clear 



RInChI-0.02.1S/C8H12O2/cl-2-4-8-6(3-l)9-5-7(8)10-8/h6-7H # l-SH2/t€-,7* # 8- 

/«l/«l///C8H1403/c9-6-5-U-7-3-l-2-4-8 (6, 7) 10/M-7, 9-10H, 1-5H2/C6- 

, 7*, 8<>/*l/3l//C8H1403/c9-6-5-ll-7-3-l-2-4-8 (6, 7) 10/h6-7, 9-10H, 1-5H2/C6-, 7- 

,8+/al/Jl/d+ 

RAuxXnfo-0.02.1/0/N:10,9,8,5,6,3,2,l,7,4/it:i^ 

I;92;s3a6; s3; 95:5939; P3;/rC:7.3358, 3.8267, 0; 8.1177, 4 .0854,0:7.3358, 3.0006,0:7. 50 
71, 4. 6413, 0:6. 6153, 4* 2457,0:8.6052, 3. 4116, 0:8. 1177, 2. 747,0:6. 6153, 2 .5855, 0:5. 903 
2,3.8267,0:5.9032,3.0006,0:7.3358,2.1721,0:///0/H:ll,10,7,4,6,3,2,l,9,5,6/lt:im/ 
rA:12nCCCCOOCCOCCH/rB:al:9l;9l:Pl;92;d2;33s6;K3;34;d7sl0:!l2:/rC: 13. 0159, 3. 6915,0 
: 13. 0159, 2. 8605, 0:13. 8028, 3. 949, 0:12. 2979, 4. 1097, 0:13. 0159, 4. 5219, 0:13. 8028, 2. 60 
83,0:12.2979,2.4492,0:14.292,3.2775,0:14.0549,4.7391,0:11.5822,3.6913,0:11.5822, 
2.8603,0;13.0159,2.0687,0://0/H:ll,10,7,4,8,3,2,l,9,5,6/lt:im/rA:12nCCCCOOCCOCCM 
/rB:9l;9l;8l:Pl;a2;52;9396;H3;34:97siO;P2:/rC:17.3409 # 3.6913,0:17.3409,2.6605,0; 
18.1278,3.949,0:16.6229,4.1097,0:17.3409,4.5219,0:18.1278,2.6083,0:16.6229,2.449 
2,0:18.617,3.2775,0:18.38,4.7391,0:15.9072,3.6915,0:15.9072,2.8605,0:17.3409,2.0 
687,0: 

X^ng-RInChXKey-bSA-FEANN-MZCOEIXCNTO 
LVAKXUYGMI ASKQ- PR JMDXOY-N 



Figure 6 Conversion of a sample Rxnfile to a RlnChl and RlnChlKeys using the RlnChl website tool. 



key. This could be addressed by sorting first on major 
layers and then on minor layers. We intend to address 
this issue in a future version of the RlnChl protocol. 

Generation of RlnChl from RDfile 

The conversion of large reaction databases to the corre- 
sponding database of RInChls is fast and reduces the size 



of the database by about 90% by eliminating most non- 
relevant information. The conversion script extracts from 
a large RDfile the embedded rxnfiles and the molfiles 
representing agents, catalysts and solvents. The latter in- 
formation is of special interest for identifying variations of 
a given core reaction. Therefore, the program generates as 
many Rxnfiles from a reaction as there are variations. An 
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RlnChl Decoder 

Paste RlnChl and, optionally, RAuxInfo (seperated by a newline) below to generate an 
rxnfile. 

N.6. The rxnfile generated by this form will not contain 20 coordinate data if no 
RAuxInfo rs provided. However, the dowloadable RlnChl software pack can aenerate 
new 2D coordiates using ChemAxon's MoJConverter. 



RInOiI-0.02.1S/C8H1202/cl-2-4-8-6(3-ll 9-S-7 (8) :0-8/h6-7H f 1- ■ 
SK2/t«-, 7-, 6-/iii/sl///C6H1403/c&-€-S-ll-7-3-l-2-4-6 (6, 7) 
10/h6-7, 9-10H, l-5H2/t€-, 7+, S+/al/sl//CaH1403/c9-6-5-li-7-S- 
1-2-4-5 16, 7) 10/h6-7, 9-10H, l-5H2/t€-, 7-, 8+/ml/sl/d+ 
RAuxIaf o«0. 02 . 1/0/N: 10, 9, B, 5, 6, 3, 2, 1,7, 4/lt :im/rA: llnCCCOCCO v 



Submit Qjer/ | [ Rese: 



Conversion 



RxnFile 

Figure 7 Web-based tool for decoding a RlnChl to a Rxnfile. 



example for the conversion is shown in Figure 8, again 
using a web-based conversion tool 

RlnChl databases can be easily manipulated (see Sec- 
tion RlnChl Applications) for analysis. For example, data- 
bases from different sources can be checked for duplicate 
reactions, for reactions using the same starting material or 
yielding the same product. 



RlnChl applications 

Generation of a RlnChl for multistep reactions 

The web-based tool (Figure 9) allows the formation of 
a summary RlnChl for multistep reactions from the 
RInChls of the individual steps. The RInChls of each 
of the reactions have to be generated separately and 
added into the box in the correct sequential order. 



RInChls from RDMes 

Choose an RfXrfe N« to upload (max s ue 2 MB): 
0 ^*a\RxnlnChr^hyd I Browse I 

N.8. Reaction records tn RDftes often contan data, such as catalysts and solvents, and 
the conversion program attempts to include this n the output RInChls. Any structures 
encountered which cannot be expressed *\ InChl format are displayed as "x* wrtNn 
the RlnChl. 



1 UptoadF* H Oear ] 



1 0,.N«HCO>CHA °' 




-FY- 



I 2<wj.ch,ci, 

y V V 

RlnChl=0.02.1S///////C10H16O3/c1-7-4-6-10( 3.12)6-9(7)13-8(2)1 1/h4-5.7.9,12H,6H2.1- 
3H3A7-,94-,10-/m0/s1//C12H18O6/c1-7-10(16-8(2)14)5-12(4,f5-13)18-1 1(7)1 7-9(3)1 5/h6- 
7,10-1 1H,5H2.1-4H3A7-,10-,1 1* ,12-/m1/s1//C13H20O7/c1-8(12(17)18-5)1 1(19-9(2)15)6- 
13<47-14)20-10(3)16/h7-8J1H.6H2>5^ 

2H3//C5H5N/c1-2-4-6-5-3-1/^1-5H//CH2CI2/c2-1-3/h1H2//CH203.Na/c2- 
1(3)4;/h(H2.2,3,4);/q;+1/p-1//CH40/c1-2^2H,1H3//03/c1-3-2/d+ 

Figure 8 Conversion of a sample RDfile into a RlnChl. 
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RlnChl Addition 



RlnCnls can be added *i such a way that the individual steps for a multistep reaction 
are combined into a single RlnChl desenbing the whole process. 

Enter the RInChls representing the *>dividual steps below. Be sure to list the steps in 
the correct order, with each one beginning on a new line. 

13, 1101, iHVb20-14*///C3H40/cl-3-3-4/hJ-SH, W2/ /OHiN/ct-i- A | 
~-i-2-:-2-5-7/h:-5H, €,SH2//C9H1403/cl-2-12-9 (11)7-5-3-4-6-8 
(7) 10/h7H. 2-6H2, lK3/d- 

RInChI-0.02.13/C17H21NO2/cl9-15-9-4-5-10-17(15) 11-6-12-18 (1« 
(17)20)13-14-7-2-l-3-8-14/hl-3,7-8H,4-6,9- 



Subm* G | Reset 



J* 



benzene >x q 



Overall reaction and RlnChl 



o o 



" CH . + HJC^ 0 



r> 



Oh 



Pd-C. HCOQMHOH V 



RlnChW.02.1S/C10H19NO/c12- 9-4-1 -2-5-10(9)6- 3-7-1 1-8-1 0/h9.1 1-1 2H.1- 
8H2A9-.1 0-/mO/s1///C3H4O/c1 -2-3-4/h2-3H,1 H2//C7H9N/c8-6-7-4-2-1 -3-5- 
7/h1 -5 H .6 .8 H2//C9 H1 403/c1 -2-12-9(11 )7-5-3-4-6-8 (7) 1 0/h7 H 2- 
6H2.1H3///C17H21 NO2/c19-15-9-4-5-10-17(15)1 1-6-1 2-1 8(16(17)20)13-1 4-7- 
2-1 -3-8-14/h1 -3 .7-8 H .4-6 .9-1 3H2/d- 



Figure 9 Generation of a RlnChl for an overall reaction (multistep reaction). 



The Python script then produces a RlnChl for the overall 
reaction that shows the initial starting material(s), the final 
product(s) and any starting material(s) or product(s) in the 
sequence of reactions that have not been changed. Some 
detailed information about each step is lost when multiple 
steps are combined, and the resultant RlnChl cannot dis- 
tinguish between reagents, solvents and catalysts in inter- 
mediate steps of the overall process. 



RlnChl tools for analysis 

Because of their smaller size as compared to RDfiles while 
still containing all essential chemical information, RlnChl 
databases are very well suited for large-scale analysis. At the 
writing of this note, substance searching and changes in 
stereochemistry and rings have been implemented as Python 
scripts to exemplify the potential of RInChls. The analyses 
can easily be carried out using the program s website. 



RlnChl Searching 

RlnChl databases (such as those produced by the ROfile converter above) can be 
searched for specific reagents m particular roles easily and quickly. 

Enter the InChl to tirth for here: 

|I^I-13/Cm4Q3/c»-«-S-ll-7-3-l-2-4-8 {€, 7) 10/h«-7, »-101, 1- 

Upload a RInCh! database here: 



D \Data\RxnlnChlVepoxide I Browse I 

Fmalry. select which roles you are searching for: 

□ Reactant 
0 Product 

□ Equilibnum reagent 

□ Reaction agent 




I Upload Fto 1 1 Rett 



Figure 10 Web-based searching for a product (shown) in RlnChl database. 
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| Submit Query 1 1 Reset | 

RlnChl Analysis (cyclic changes) 

RInChI databases can be analysed for reactions creating or destroying nngs. 

Upload a RInChI databa se for analysi s here: 
| 1 1 Browse.. . I 

Select what to count: 

0 Absolute change O Change per molecule O Change per cydic molecule 
List analysed RInChls? □ 
| Upload File \ [ Reset ] 

Figure 1 1 Analysis of cyclic molecules in a reaction. 



Searching for reaction partners 

RInChI databases can be searched for compounds taking 
part in a reaction as reactant, product, agent or equilib- 
rium agent. For searching the database for the benzofuran 
derivative shown in Figure 10 as a product, the InChI no- 
tation of the compound and the RInChI database to be 
searched have to be entered into the respective boxes on 
the website. The result is a list of RInChls of reactions that 
produce the benzofuran derivative. From this list the indi- 
vidual Rxnfiles and, subsequently, the structural diagram 
of the reactions can be generated via the RInChI decoder 
utility (Figure 7). 



Structural analyses 

The potential of analyzing RInChls is further demon- 
strated by two preliminary analytical web-based tools 
which have been implemented in the RInChI program 
for certain structural changes in molecules participating 
in a reaction. However, their full application is limited 
by the lack of stoichiometric information in RInChls. 

One script searches a RInChI database for reactions 
in which the number of rings on either side of the reac- 
tion changes. Additionally, it is possible to count the 
change in rings per molecule or rings per cyclic molecule. 
This tool is based on the information entailed in the 



RlnChl Analysts (stereochemical changes) 

RlnChl databases can b« analysed for reactions creating or destro.*>g 
stereochemistry. 

Upload a RlnChl databa se for analysi s here: 

□CES5ZI 

Select what to count: 
0 Denned centres only 

0 Al stereo centres O SP2 centres only O SP3 centres only 

0 Absolute change O Change per molecule O Change per stereospecAc molecule 

List analysed RInChls? □ 

| Upload File 1 [ Reset [ 



Figure 12 Stereochemical analysis of RInChls. 
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connectivity layer of the individual InChls within a 
RInChI (Figure 11). 

The second tool analyses the stereochemical informa- 
tion in the layers of individual InChls within a RInChI 
to calculate the changes in the number of stereocenters 
per molecule in a reaction (Figure 12). 

Database analysis 

In order to further these goals, four large RDfiles containing 
nearly three thousand reactions, provided by Elsevier [13], 
FIZ Chemie Berlin [14], and InfoChem [15], were used for 
testing. With the large database of RInChls generated from 
these files, much more information on the strengths and 
weaknesses of the format could be gleaned and general tools 
for RInChI manipulation developed. 

These data sets were processed to generate 2900 
RInChls. The process took a few minutes on a desktop 
computer. Most of the computer time was required for 
generating InChls from the structures in the RDfiles. 

The file size was reduced by a factor of thirty moving 
from RDfiles to RInChI. Although 97% of the size was lost, 
most reaction data were retained. By removing a lot of in- 
formation without chemical relevance, such as Cartesian 
coordinates, it is possible to manipulate and search the rest 
very quickly, using simple unix commands. 

This database of RInChls could be analyzed very rapidly 
using simple text-handling tools. Sorting the list showed 
that there were 298 duplicates. These turned out to be 
very similar processes which were distinguished only by 
free-text comments in the RDfiles. They were slightly dif- 
ferent, therefore, but not different enough to have distinct 
RInChls. The RInChI file contained 2602 unique reac- 
tions, in which 7342 molecules were present. Comparing 
these molecules across the whole file showed that 5240 of 
them were unique. It was possible to quickly identify the 
examples for which the same starting materials led to dif- 
ferent products and different starting materials led to the 
same products. Although this fairly small database did not 
lead to any startling new discoveries, it illustrates how 
large amounts of chemical data can be compressed and 
analyzed effectively and cheaply with scalability to much 
larger systems. 

Conclusion 

This note outlines the initial development of a program to 
generate the non-proprietary International Identifier for 
Reactions (RInChI). The identifier describes chemical 
reactions in a unique, freely-available and machine- 
readable character string that can be used both in printed 
and electronic data sources. The program is an extension 
of the IUPAC InChI project. A software package has been 
developed to generate RInChls and RInChlKeys from 
Rxnfiles and RDfiles and to regenerate Rxnfiles from 
RInChls. The package also includes several scripts to 



analyze databases for certain reaction participants and 
structural changes in rings or in stereochemistry. All tools 
are web-based and are available on the projects website at 
http://www-rinchi.ch.cam.ac.uk. The individual web-based 
tools on the website are shown in the figures together with 
relevant examples. Further work on the project under the 
supervision of the InChI Trust is continuing. 
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